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The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its 
enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach 
to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep 
neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement 
learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state- 
of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a 
new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, 
our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go 
champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the 
full-sized game of Go, a feat previously thought to be at least a decade away. 


All games of perfect information have an optimal value function, v (s), 
which determines the outcome of the game, from every board position 
or state s, under perfect play by all players. These games may be solved 
by recursively computing the optimal value function in a search tree 
containing approximately b4 possible sequences of moves, where b is 
the game's breadth (number of legal moves per position) and d is its 
depth (game length). In large games, such as chess (b~35, d~ 80)! and 
especially Go (b= 250, d= 150)!, exhaustive search is infeasible”, but 
the effective search space can be reduced by two general principles. 
First, the depth of the search may be reduced by position evaluation: 
truncating the search tree at state s and replacing the subtree below s 
by an approximate value function v(s) ~ v'(s) that predicts the outcome 
from state s. This approach has led to superhuman performance in 
chess‘, checkers” and othello®, but it was believed to be intractable in Go 
due to the complexity of the game’. Second, the breadth of the search 
may be reduced by sampling actions from a policy p(a|s) that is a prob- 
ability distribution over possible moves a in position s. For example, 
Monte Carlo rollouts® search to maximum depth without branching 
at all, by sampling long sequences of actions for both players from a 
policy p. Averaging over such rollouts can provide an effective position 
evaluation, achieving superhuman performance in backgammon? and 
Scrabble’, and weak amateur level play in Go”, 

Monte Carlo tree search (MCTS)!!! uses Monte Carlo rollouts 
to estimate the value of each state in a search tree. As more simu- 
lations are executed, the search tree grows larger and the relevant 
values become more accurate. The policy used to select actions during 
search is also improved over time, by selecting children with higher 
values. Asymptotically, this policy converges to optimal play, and the 
evaluations converge to the optimal value function’. The strongest 
current Go programs are based on MCTS, enhanced by policies that 
are trained to predict human expert moves!3. These policies are used 
to narrow the search to a beam of high-probability actions, and to 
sample actions during rollouts. This approach has achieved strong 
amateur play'?-!°. However, prior work has been limited to shallow 


policies!*-1> or value functions! based on a linear combination of 


input features. 

Recently, deep convolutional neural networks have achieved unprec- 
edented performance in visual domains: for example, image classifica- 
tion”, face recognition'’, and playing Atari games!*. They use many 
layers of neurons, each arranged in overlapping tiles, to construct 
increasingly abstract, localized representations of an image”’. We 
employ a similar architecture for the game of Go. We pass in the board 
position as a 19 x 19 image and use convolutional layers to construct a 
representation of the position. We use these neural networks to reduce 
the effective depth and breadth of the search tree: evaluating positions 
using a value network, and sampling actions using a policy network. 

We train the neural networks using a pipeline consisting of several 
stages of machine learning (Fig. 1). We begin by training a supervised 
learning (SL) policy network p, directly from expert human moves. 
This provides fast, efficient learning updates with immediate feedback 
and high-quality gradients. Similar to prior work'*!°, we also train a 
fast policy p, that can rapidly sample actions during rollouts. Next, we 
train a reinforcement learning (RL) policy network p, that improves 
the SL policy network by optimizing the final outcome of games of self- 
play. This adjusts the policy towards the correct goal of winning games, 
rather than maximizing predictive accuracy. Finally, we train a value 
network vg that predicts the winner of games played by the RL policy 
network against itself. Our program AlphaGo efficiently combines the 
policy and value networks with MCTS. 


Supervised learning of policy networks 

For the first stage of the training pipeline, we build on prior work 
on predicting expert moves in the game of Go using supervised 
learning!?”!*. The SL policy network p,(a|s) alternates between con- 
volutional layers with weights ø, and rectifier nonlinearities. A final soft- 
max layer outputs a probability distribution over all legal moves a. The 
input s to the policy network is a simple representation of the board state 
(see Extended Data Table 2). The policy network is trained on randomly 


1Google DeepMind, 5 New Street Square, London EC4A 3TW, UK. 2Google, 1600 Amphitheatre Parkway, Mountain View, California 94043, USA. 


*These authors contributed equally to this work. 


484 | NATURE | VOL 529 | 28 JANUARY 2016 


© 2016 Macmillan Publishers Limited. All rights reserved 


ARTICLE 


a b 
Rollout policy SL policy network RL policy network Value network Policy network Value network 
2 a 
Pa Po P, Vo 2 Pop (als) v (Ss) 
D . 
= 
Q 
= 
XX ppg Policy gradient kpg Q 
xn 
i 
ow 
§ 
S 
D D 
D 


Human expert positions 
Figure 1 | Neural network training pipeline and architecture. a, A fast 
rollout policy p, and supervised learning (SL) policy network p, are 
trained to predict human expert moves in a data set of positions. 
A reinforcement learning (RL) policy network p, is initialized to the SL 
policy network, and is then improved by policy gradient learning to 
maximize the outcome (that is, winning more games) against previous 
versions of the policy network. A new data set is generated by playing 
games of self-play with the RL policy network. Finally, a value network vg 
is trained by regression to predict the expected outcome (that is, whether 


sampled state-action pairs (s, a), using stochastic gradient ascent to 
maximize the likelihood of the human move a selected in state s 


dlog p,(a|s) 
x ee eee 
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We trained a 13-layer policy network, which we call the SL policy 
network, from 30 million positions from the KGS Go Server. The net- 
work predicted expert moves on a held out test set with an accuracy of 
57.0% using all input features, and 55.7% using only raw board posi- 
tion and move history as inputs, compared to the state-of-the-art from 
other research groups of 44.4% at date of submission” (full results in 
Extended Data Table 3). Small improvements in accuracy led to large 
improvements in playing strength (Fig. 2a); larger networks achieve 
better accuracy but are slower to evaluate during search. We also 
trained a faster but less accurate rollout policy p,(a|s), using a linear 
softmax of small pattern features (see Extended Data Table 4) with 
weights 7; this achieved an accuracy of 24.2%, using just 2 us to select 
an action, rather than 3 ms for the policy network. 


Ao 


Reinforcement learning of policy networks 

The second stage of the training pipeline aims at improving the policy 
network by policy gradient reinforcement learning (RL). The RL 
policy network p, is identical in structure to the SL policy network, 
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Figure 2 | Strength and accuracy of policy and value networks. 

a, Plot showing the playing strength of policy networks as a function 

of their training accuracy. Policy networks with 128, 192, 256 and 384 
convolutional filters per layer were evaluated periodically during training; 
the plot shows the winning rate of AlphaGo using that policy network 


against the match version of AlphaGo. b, Comparison of evaluation 
accuracy between the value network and rollouts with different policies. 


Self-play positions 


the current player wins) in positions from the self-play data set. 

b, Schematic representation of the neural network architecture used in 
AlphaGo. The policy network takes a representation of the board position 
s as its input, passes it through many convolutional layers with parameters 
a (SL policy network) or p (RL policy network), and outputs a probability 
distribution p, (a|s) or p, (a|s) over legal moves a, represented by a 
probability map over the board. The value network similarly uses many 
convolutional layers with parameters 0, but outputs a scalar value vo(s’) 
that predicts the expected outcome in position s’. 


and its weights p are initialized to the same values, p = ø. We play 
games between the current policy network p, and a randomly selected 
previous iteration of the policy network. Randomizing from a pool 
of opponents in this way stabilizes training by preventing overfitting 
to the current policy. We use a reward function r(s) that is zero for all 
non-terminal time steps t < T. The outcome z= + r(s7) is the termi- 
nal reward at the end of the game from the perspective of the current 
player at time step t: +1 for winning and —1 for losing. Weights are 
then updated at each time step t by stochastic gradient ascent in the 
direction that maximizes expected outcome”? 


dlog p,(a1|s1) 
p x —— mz 
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We evaluated the performance of the RL policy network in game 
play, sampling each move a; ~ P, (-|s+) from its output probability 
distribution over actions. When played head-to-head, the RL policy 
network won more than 80% of games against the SL policy network. 
We also tested against the strongest open-source Go program, Pachi!4, 
a sophisticated Monte Carlo search program, ranked at 2 amateur dan 
on KGS, that executes 100,000 simulations per move. Using no search 
at all, the RL policy network won 85% of games against Pachi. In com- 
parison, the previous state-of-the-art, based only on supervised 
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Positions and outcomes were sampled from human expert games. Each 
position was evaluated by a single forward pass of the value network vg, 

or by the mean outcome of 100 rollouts, played out using either uniform 
random rollouts, the fast rollout policy p the SL policy network p, or 

the RL policy network p,. The mean squared error between the predicted 
value and the actual game outcome is plotted against the stage of the game 
(how many moves had been played in the given position). 
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Figure 3 | Monte Carlo tree search in AlphaGo. a, Each simulation 
traverses the tree by selecting the edge with maximum action value Q, 
plus a bonus u(P) that depends on a stored prior probability P for that 
edge. b, The leaf node may be expanded; the new node is processed once 
by the policy network p, and the output probabilities are stored as prior 
probabilities P for each action. c, At the end of a simulation, the leaf node 


learning of convolutional networks, won 11% of games against Pachi” 
and 12% against a slightly weaker program, Fuego”. 


Reinforcement learning of value networks 

The final stage of the training pipeline focuses on position evaluation, 
estimating a value function v?(s) that predicts the outcome from posi- 
tion s of games played by using policy p for both players***° 


vP(s)=Elz,|si=s, as. .r~p] 


Ideally, we would like to know the optimal value function under 
perfect play v“(s); in practice, we instead estimate the value function 
vP for our strongest policy, using the RL policy network p,. We approx- 
imate the value function using a value network v¢(s) with weights 0, 
vols) = vP(s) ~ v*(s). This neural network has a similar architecture 
to the policy network, but outputs a single prediction instead of a prob- 
ability distribution. We train the weights of the value network by regres- 
sion on state-outcome pairs (s, z), using stochastic gradient descent to 
minimize the mean squared error (MSE) between the predicted value 
vo(s), and the corresponding outcome z 


Ovo(s) 
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oð 


va(s)) 


The naive approach of predicting game outcomes from data con- 
sisting of complete games leads to overfitting. The problem is that 
successive positions are strongly correlated, differing by just one stone, 
but the regression target is shared for the entire game. When trained 
on the KGS data set in this way, the value network memorized the 
game outcomes rather than generalizing to new positions, achieving a 
minimum MSE of 0.37 on the test set, compared to 0.19 on the training 
set. To mitigate this problem, we generated a new self-play data set 
consisting of 30 million distinct positions, each sampled from a sepa- 
rate game. Each game was played between the RL policy network and 
itself until the game terminated. Training on this data set led to MSEs 
of 0.226 and 0.234 on the training and test set respectively, indicating 
minimal overfitting. Figure 2b shows the position evaluation accuracy 
of the value network, compared to Monte Carlo rollouts using the fast 
rollout policy p,; the value function was consistently more accurate. 
A single evaluation of v¢(s) also approached the accuracy of Monte 
Carlo rollouts using the RL policy network p,, but using 15,000 times 
less computation. 


Searching with policy and value networks 
AlphaGo combines the policy and value networks in an MCTS algo- 
rithm (Fig. 3) that selects actions by lookahead search. Each edge 
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Evaluation 


is evaluated in two ways: using the value network vg; and by running 

a rollout to the end of the game with the fast rollout policy p, then 
computing the winner with function r. d, Action values Q are updated to 
track the mean value of all evaluations r(-) and vg(-) in the subtree below 
that action. 


(s, a) of the search tree stores an action value Q(s, a), visit count N(s, a), 
and prior probability P(s, a). The tree is traversed by simulation (that 
is, descending the tree in complete games without backup), starting 
from the root state. At each time step t of each simulation, an action a; 
is selected from state s; 


a, =argmax(Q(s,,a) + u(s,,a)) 


a 


so as to maximize action value plus a bonus 


that is proportional to the prior probability but decays with 
repeated visits to encourage exploration. When the traversal reaches a 
leaf node sz at step L, the leaf node may be expanded. The leaf position 
sy is processed just once by the SL policy network p,. The output prob- 
abilities are stored as prior probabilities P for each legal action a, 
P(s,a) =p,(a|s). The leaf node is evaluated in two very different ways: 
first, by the value network vo(s;); and second, by the outcome zz of a 
random rollout played out until terminal step T using the fast rollout 
policy pz; these evaluations are combined, using a mixing parameter 
A, into a leaf evaluation V(s;) 


V (sr) =(1—A)vols) + Azz 


At the end of simulation, the action values and visit counts of all 
traversed edges are updated. Each edge accumulates the visit count and 
mean evaluation of all simulations passing through that edge 


where si is the leaf node from the ith simulation, and 1(s, a, i) indicates 
whether an edge (s, a) was traversed during the ith simulation. Once 
the search is complete, the algorithm chooses the most visited move 
from the root position. 

It is worth noting that the SL policy network p, performed better in 
AlphaGo than the stronger RL policy network p,, presumably because 
humans select a diverse beam of promising moves, whereas RL opti- 
mizes for the single best move. However, the value function 
vo(s) = v”»(s) derived from the stronger RL policy network performed 
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Figure 4 | Tournament evaluation of AlphaGo. a, Results of a 
tournament between different Go programs (see Extended Data Tables 
6-11). Each program used approximately 5 s computation time per move. 
To provide a greater challenge to AlphaGo, some programs (pale upper 
bars) were given four handicap stones (that is, free moves at the start of 
every game) against all opponents. Programs were evaluated on an 

Elo scale”: a 230 point gap corresponds to a 79% probability of winning, 
which roughly corresponds to one amateur dan rank advantage on 
KGS**; an approximate correspondence to human ranks is also shown, 


better in AlphaGo than a value function v(s) ~ v’-(s) derived from the 
SL policy network. 

Evaluating policy and value networks requires several orders of 
magnitude more computation than traditional search heuristics. To 
efficiently combine MCTS with deep neural networks, AlphaGo uses 
an asynchronous multi-threaded search that executes simulations on 
CPUs, and computes policy and value networks in parallel on GPUs. 
The final version of AlphaGo used 40 search threads, 48 CPUs, and 
8 GPUs. We also implemented a distributed version of AlphaGo that 


a Value network 


b Tree evaluation from value net 
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horizontal lines show KGS ranks achieved online by that program. Games 
against the human European champion Fan Hui were also included; 

these games used longer time controls. 95% confidence intervals are 
shown. b, Performance of AlphaGo, on a single machine, for different 
combinations of components. The version solely using the policy network 
does not perform any search. c, Scalability study of MCTS in AlphaGo 
with search threads and GPUs, using asynchronous search (light blue) or 
distributed search (dark blue), for 2 s per move. 


exploited multiple machines, 40 search threads, 1,202 CPUs and 
176 GPUs. The Methods section provides full details of asynchronous 
and distributed MCTS. 


Evaluating the playing strength of AlphaGo 

To evaluate AlphaGo, we ran an internal tournament among variants 
of AlphaGo and several other Go programs, including the strongest 
commercial programs Crazy Stone’? and Zen, and the strongest open 
source programs Pachi'* and Fuego”. All of these programs are based 


Cc Tree evaluation from rollouts 


Figure 5 | How AlphaGo (black, to play) selected its move in an 
informal game against Fan Hui. For each of the following statistics, 
the location of the maximum value is indicated by an orange circle. 

a, Evaluation of all successors s’ of the root position s, using the value 
network v¢(s’); estimated winning percentages are shown for the top 
evaluations. b, Action values Q(s, a) for each edge (s, a) in the tree from 
root position s; averaged over value network evaluations only (A=0). 

c, Action values Q(s, a), averaged over rollout evaluations only (A= 1). 


d, Move probabilities directly from the SL policy network, p, (a|s); 
reported as a percentage (if above 0.1%). e, Percentage frequency with 
which actions were selected from the root during simulations. f, The 
principal variation (path with maximum visit count) from AlphaGo's 
search tree. The moves are presented in a numbered sequence. AlphaGo 
selected the move indicated by the red circle; Fan Hui responded with the 
move indicated by the white square; in his post-game commentary he 
preferred the move (labelled 1) predicted by AlphaGo. 
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Game 1 
Fan Hui (Black), AlphaGo (White) 
AlphaGo wins by 2.5 points 


Game 2 


AlphaGo (Black), Fan Hui (White) 
AlphaGo wins by resignation 


Game 3 
Fan Hui (Black), AlphaGo (White) 
AlphaGo wins by resignation 
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Game 4 
AlphaGo (Black), Fan Hui (White) 
AlphaGo wins by resignation 


See)? 47) 


Game 5 
Fan Hui (Black), AlphaGo (White) 
AlphaGo wins by resignation 


OHO) 


Figure 6 | Games from the match between AlphaGo and the European 
champion, Fan Hui. Moves are shown in a numbered sequence 
corresponding to the order in which they were played. Repeated moves 
on the same intersection are shown in pairs below the board. The first 


on high-performance MCTS algorithms. In addition, we included the 
open source program GnuGo, a Go program using state-of-the-art 
search methods that preceded MCTS. All programs were allowed 5 s 
of computation time per move. 

The results of the tournament (see Fig. 4a) suggest that single- 
machine AlphaGo is many dan ranks stronger than any previous 
Go program, winning 494 out of 495 games (99.8%) against other 
Go programs. To provide a greater challenge to AlphaGo, we also 
played games with four handicap stones (that is, free moves for the 
opponent); AlphaGo won 77%, 86%, and 99% of handicap games 
against Crazy Stone, Zen and Pachi, respectively. The distributed ver- 
sion of AlphaGo was significantly stronger, winning 77% of games 
against single-machine AlphaGo and 100% of its games against other 
programs. 

We also assessed variants of AlphaGo that evaluated positions 
using just the value network (A =0) or just rollouts (A= 1) (see 
Fig. 4b). Even without rollouts AlphaGo exceeded the performance 
of all other Go programs, demonstrating that value networks provide 
a viable alternative to Monte Carlo evaluation in Go. However, the 
mixed evaluation (A =0.5) performed best, winning >95% of games 
against other variants. This suggests that the two position-evaluation 
488 | NATURE | VOL 529 | 
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move number in each pair indicates when the repeat move was played, at 
an intersection identified by the second move number (see Supplementary 
Information). 


mechanisms are complementary: the value network approximates the 
outcome of games played by the strong but impractically slow p,, while 
the rollouts can precisely score and evaluate the outcome of games 
played by the weaker but faster rollout policy p+. Figure 5 visualizes 
the evaluation of a real game position by AlphaGo. 

Finally, we evaluated the distributed version of AlphaGo against Fan 
Hui, a professional 2 dan, and the winner of the 2013, 2014 and 2015 
European Go championships. Over 5-9 October 2015 AlphaGo and 
Fan Hui competed in a formal five-game match. AlphaGo won the 
match 5 games to 0 (Fig. 6 and Extended Data Table 1). This is the 
first time that a computer Go program has defeated a human profes- 
sional player, without handicap, in the full game of Go—a feat that was 
previously believed to be at least a decade away?! 


Discussion 

In this work we have developed a Go program, based on a combina- 
tion of deep neural networks and tree search, that plays at the level of 
the strongest human players, thereby achieving one of artificial intel- 
ligence’s “grand challenges”?!~*°. We have developed, for the first time, 
effective move selection and position evaluation functions for Go, 
based on deep neural networks that are trained by a novel combination 
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of supervised and reinforcement learning. We have introduced a new 
search algorithm that successfully combines neural network evalu- 
ations with Monte Carlo rollouts. Our program AlphaGo integrates 
these components together, at scale, in a high-performance tree search 
engine. 

During the match against Fan Hui, AlphaGo evaluated thousands 
of times fewer positions than Deep Blue did in its chess match against 
Kasparov’; compensating by selecting those positions more intelli- 
gently, using the policy network, and evaluating them more precisely, 
using the value network—an approach that is perhaps closer to how 
humans play. Furthermore, while Deep Blue relied on a handcrafted 
evaluation function, the neural networks of AlphaGo are trained 
directly from gameplay purely through general-purpose supervised 
and reinforcement learning methods. 

Go is exemplary in many ways of the difficulties faced by artificial 
intelligence*?**; a challenging decision-making task, an intractable 
search space, and an optimal solution so complex it appears infeasible 
to directly approximate using a policy or value function. The previous 
major breakthrough in computer Go, the introduction of MCTS, led to 
corresponding advances in many other domains; for example, general 
game-playing, classical planning, partially observed planning, sched- 
uling, and constraint satisfaction****. By combining tree search with 
policy and value networks, AlphaGo has finally reached a professional 
level in Go, providing hope that human-level performance can now be 
achieved in other seemingly intractable artificial intelligence domains. 


Online Content Methods, along with any additional Extended Data display items and 
Source Data, are available in the online version of the paper; references unique to 
these sections appear only in the online paper. 
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METHODS 


Problem setting. Many games of perfect information, such as chess, checkers, 
othello, backgammon and Go, may be defined as alternating Markov games™. In 
these games, there is a state space S (where state includes an indication of the 
current player to play); an action space A(s) defining the legal actions in any given 
state s € S; a state transition function f(s, a, €) defining the successor state after 
selecting action a in state s and random input € (for example, dice); and finally a 
reward function r‘(s) describing the reward received by player i in state s. We 
restrict our attention to two-player zero-sum games, r'(s) =—?(s) =1(s), with 
deterministic state transitions, f(s, a, £) = f(s, a), and zero rewards except at a ter- 
minal time step T. The outcome of the game z;= +r(s7) is the terminal reward at 
the end of the game from the perspective of the current player at time step t. 
A policy p(a|s) is a probability distribution over legal actions a € A(s). 
A value function is the expected outcome if all actions for both players are selected 
according to policy p, that is, v?(s) = E[z;|s; = s, a;...r ~ p]. Zero-sum games have 
a unique optimal value function v*(s) that determines the outcome from state s 
following perfect play by both players, 


ZT if s= sr, 


v*(s)= 


max — v*(f(s,a)) otherwise 


Prior work. The optimal value function can be computed recursively by minimax 
(or equivalently negamax) search’. Most games are too large for exhaustive min- 
imax tree search; instead, the game is truncated by using an approximate value 
function v(s) ~ v*(s) in place of terminal rewards. Depth-first minimax search with 
alpha-beta pruning” has achieved superhuman performance in chess‘, checkers? 
and othello®, but it has not been effective in Go’. 

Reinforcement learning can learn to approximate the optimal value function 
directly from games of self-play*’. The majority of prior work has focused on a 
linear combination v4(s) = (s) - 6 of features y(s) with weights 6. Weights were 
trained using temporal-difference learning”! in chess**’, checkers“** and Go”; 
or using linear regression in othello® and Scrabble’. Temporal-difference learning 
has also been used to train a neural network to approximate the optimal value 
function, achieving superhuman performance in backgammon"; and achiev- 
ing weak kyu-level performance in small-board Go**”?-*” using convolutional 
networks. 

An alternative approach to minimax search is Monte Carlo tree search 
(MCTS)!"!?, which estimates the optimal value of interior nodes by a double 
approximation, V"(s) ~ v?"(s) ~ v*(s). The first approximation, V"(s) ~ v?"(s), 
uses n Monte Carlo simulations to estimate the value function of a simulation 
policy P". The second approximation, v?’"(s)  v*(s), uses a simulation policy P” 
in place of minimax optimal actions. The simulation policy selects actions accord- 
ing to a search control function argmax, (Q"(s,a) + u(s,a)) such as UCT”, that 
selects children with higher action values, Q”(s, a) = —V"(f(s, a)), plus a bonus 
u(s, a) that encourages exploration; or in the absence of a search tree at state s, it 
samples actions from a fast rollout policy p_(a|s). As more simulations are executed 
and the search tree grows deeper, the simulation policy becomes informed by 
increasingly accurate statistics. In the limit, both approximations become exact 
and MCTS (for example, with UCT) converges!” to the optimal value function 
lim, .0V"(s) = lim,_...v?"(s) = v*(s). The strongest current Go programs are 
based on MCTS!*-19°, 

MCTS has previously been combined with a policy that is used to narrow the 
beam of the search tree to high-probability moves!; or to bias the bonus term 
towards high-probability moves*®. MCTS has also been combined with a value 
function that is used to initialize action values in newly expanded nodes", or to 
mix Monte Carlo evaluation with minimax evaluation’. By contrast, AlphaGo’s use 
of value functions is based on truncated Monte Carlo search algorithms®”, which 
terminate rollouts before the end of the game and use a value function in place of 
the terminal reward. AlphaGos position evaluation mixes full rollouts with trun- 
cated rollouts, resembling in some respects the well-known temporal-difference 
learning algorithm TD(,). AlphaGo also differs from prior work by using slower 
but more powerful representations of the policy and value function; evaluating 
deep neural networks is several orders of magnitude slower than linear representa- 
tions and must therefore occur asynchronously. 

The performance of MCTS is to a large degree determined by the quality of the 
rollout policy. Prior work has focused on handcrafted patterns® or learning rollout 
policies by supervised learning’, reinforcement learning", simulation balanc- 
ing? or online adaptation*”°?; however, it is known that rollout-based position 
evaluation is frequently inaccurate”*. AlphaGo uses relatively simple rollouts, and 
instead addresses the challenging problem of position evaluation more directly 
using value networks. 


Search algorithm. To efficiently integrate large neural networks into AlphaGo, we 
implemented an asynchronous policy and value MCTS algorithm (APV-MCTS). 
Each node s in the search tree contains edges (s, a) for all legal actions a € A(s). 
Each edge stores a set of statistics, 


{P(s,a), N,(s,a), N,(s,a), Wy(s,a), W,(s,a), Q(s,a)} 


where P(s, a) is the prior probability, W,(s, a) and W,(s, a) are Monte Carlo esti- 
mates of total action value, accumulated over N,(s, a) and N,(s, a) leaf evaluations 
and rollout rewards, respectively, and Q(s, a) is the combined mean action value for 
that edge. Multiple simulations are executed in parallel on separate search threads. 
The APV-MCTS algorithm proceeds in the four stages outlined in Fig. 3. 

Selection (Fig. 3a). The first in-tree phase of each simulation begins at the root of 
the search tree and finishes when the simulation reaches a leaf node at time step 
L. At each of these time steps, t < L, an action is selected according to the statistics 


in the search tree, a, = argmax, (Q(s;,@) + u(s;, a)) using a variant of the PUCT 
JENS 0) 
1+N,(s,a) 

the level of exploration; this search control strategy initially prefers actions with 
high prior probability and low visit count, but asymptotically prefers actions with 
high action value. 

Evaluation (Fig. 3c). The leaf position sz is added to a queue for evaluation vg(sz) 
by the value network, unless it has previously been evaluated. The second rollout 
phase of each simulation begins at leaf node sz and continues until the end of the 
game. At each of these time-steps, t > L, actions are selected by both players accord- 
ing to the rollout policy, a; ~ p, (-|s;). When the game reaches a terminal state, the 
outcome z; = + r(sr) is computed from the final score. 
Backup (Fig. 3d). At each in-tree step t< L of the simulation, the rollout statistics 
are updated as if it has lost ny games, N,(s;, a+) — N,(s;, a1) + ny; W,(s; ar) — W,(s1, 
ai) —nys; this virtual loss” discourages other threads from simultaneously explor- 
ing the identical variation. At the end of the simulation, t he rollout statistics are 
updated in a backward pass through each step t< L, replacing the virtual losses by 
the outcome, N,(s;, a1) — N,(sp ai) ~My + 1; Wilse ai) — Wsp ar) + nv + Ze 
Asynchronously, a separate backward pass is initiated when the evaluation 
of the leaf position sz completes. The output of the value network v(s;) is used to 
update value statistics in a second backward pass through each step t < L, 
N,(sp ai) —N,(s, a) +1, W,(s, ai) — W,(sp ai) + vols). The overall evaluation of 
each state action is a weighted average of the Monte Carlo estimates, 


Wy(s, a) W,(s, a) 
Q(s.a) (1 A) Ny(s,a) + Aidaa) 


rollout evaluations with weighting parameter À. All updates are performed 
lock-free®. 

Expansion (Fig. 3b). When the visit count exceeds a threshold, N,(s, a) > nmr, the 
successor state s’ = f(s, a) is added to the search tree. The new node is initialized 
to {N(s’, a) =N,(s’, a) =0, W(s’, a) = W,(s’, a) =0, P(s',a) =p,(a|s’)}, using a tree 
policy p,(a|s’) (similar to the rollout policy but with more features, see Extended 
Data Table 4) to provide placeholder prior probabilities for action selection. The 
position s’ is also inserted into a queue for asynchronous GPU evaluation by the 
policy network. Prior probabilities are computed by the SL policy network pË (-|s’) 


algorithm’’, u(s, a) = CpuctP(s, a) , Where Cpuct is a constant determining 


, that mixes together the value network and 


with a softmax temperature set to (3; these replace the placeholder prior probabil- 
ities, P(s’, a) — pe (a|s’), using an atomic update. The threshold nr is adjusted 
dynamically to ensure that the rate at which positions are added to the policy queue 
matches the rate at which the GPUs evaluate the policy network. Positions are 
evaluated by both the policy network and the value network using a mini-batch 
size of 1 to minimize end-to-end evaluation time. 

We also implemented a distributed APV-MCTS algorithm. This architecture 
consists of a single master machine that executes the main search, many remote 
worker CPUs that execute asynchronous rollouts, and many remote worker GPUs 
that execute asynchronous policy and value network evaluations. The entire search 
tree is stored on the master, which only executes the in-tree phase of each simu- 
lation. The leaf positions are communicated to the worker CPUs, which execute 
the rollout phase of simulation, and to the worker GPUs, which compute network 
features and evaluate the policy and value networks. The prior probabilities of the 
policy network are returned to the master, where they replace placeholder prior 
probabilities at the newly expanded node. The rewards from rollouts and the value 
network outputs are each returned to the master, and backed up the originating 
search path. 

At the end of search AlphaGo selects the action with maximum visit count; this 
is less sensitive to outliers than maximizing action value”. The search tree is reused 
at subsequent time steps: the child node corresponding to the played action 
becomes the new root node; the subtree below this child is retained along with all 
its statistics, while the remainder of the tree is discarded. The match version of 
AlphaGo continues searching during the opponent’s move. It extends the search 
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if the action maximizing visit count and the action maximizing action value disa- 
gree. Time controls were otherwise shaped to use most time in the middle-game””. 
AlphaGo resigns when its overall evaluation drops below an estimated 10% prob- 
ability of winning the game, that is, max, Q(s,a) < — 0.8. 

AlphaGo does not employ the all-moves-as-first!” or rapid action value estima- 

tion” heuristics used in the majority of Monte Carlo Go programs; when using 
policy networks as prior knowledge, these biased heuristics do not appear to give 
any additional benefit. In addition AlphaGo does not use progressive widening", 
dynamic komi” or an opening book’. The parameters used by AlphaGo in the 
Fan Hui match are listed in Extended Data Table 5. 
Rollout policy. The rollout policy p_(a|s) is a linear softmax policy based on fast, 
incrementally computed, local pattern-based features consisting of both ‘response’ 
patterns around the previous move that led to state s, and ‘non-response patterns 
around the candidate move a in state s. Each non-response pattern is a binary 
feature matching a specific 3 x 3 pattern centred on a, defined by the colour (black, 
white, empty) and liberty count (1, 2, >3) for each adjacent intersection. Each 
response pattern is a binary feature matching the colour and liberty count in a 
12-point diamond-shaped pattern?! centred around the previous move. 
Additionally, a small number of handcrafted local features encode common-sense 
Go rules (see Extended Data Table 4). Similar to the policy network, the weights 
m of the rollout policy are trained from 8 million positions from human games on 
the Tygem server to maximize log likelihood by stochastic gradient descent. 
Rollouts execute at approximately 1,000 simulations per second per CPU thread 
on an empty board. 

Our rollout policy p,(a|s) contains less handcrafted knowledge than state- 
of-the-art Go programs". Instead, we exploit the higher-quality action selection 
within MCTS, which is informed both by the search tree and the policy network. 
We introduce a new technique that caches all moves from the search tree and 
then plays similar moves during rollouts; a generalization of the ‘last good reply’ 
heuristic”. At every step of the tree traversal, the most probable action is inserted 
into a hash table, along with the 3 x 3 pattern context (colour, liberty and stone 
counts) around both the previous move and the current move. At each step of the 
rollout, the pattern context is matched against the hash table; if a match is found 
then the stored move is played with high probability. 

Symmetries. In previous work, the symmetries of Go have been exploited by using 
rotationally and reflectionally invariant filters in the convolutional layers”*”5?. 
Although this may be effective in small neural networks, it actually hurts perfor- 
mance in larger networks, as it prevents the intermediate filters from identifying 
specific asymmetric patterns”. Instead, we exploit symmetries at run-time by 
dynamically transforming each position s using the dihedral group of eight reflec- 
tions and rotations, d;(s), ..., dg(s). In an explicit symmetry ensemble, a mini-batch 
of all 8 positions is passed into the policy network or value network and computed 
in parallel. For the value network, the output values are simply averaged, 
Vo(s) = i ay j=1 vo(dj(s)). For the policy network, the planes of output probabilities 
are rotated/reflected back into the original « orientation, and averaged together to 
provide an ensemble prediction, p, (-|s) = shy jad KAG |d;(s))); this approach 
was used in our raw network evaluation (see Extended Data Table 3). Instead, 
APV-MCTS makes use ofan implicit symmetry ensemble that randomly selects a 
single rotation/reflection j € [1, 8] for each evaluation. We compute exactly one 
evaluation for that orientation only; in each simulation we compute the value 
of leaf node sz by vo(d;(s1)), and allow the search procedure to average over 
these evaluations. Similarly, we compute the policy network for a single, 
randomly selected rotation/reflection, dj p, (-|dj(s))). 

Policy network: classification. We trained the policy network p, to classify posi- 
tions according to expert moves played in the KGS data set. This data set contains 
29.4 million positions from 160,000 games played by KGS 6 to 9 dan human play- 
ers; 35.4% of the games are handicap games. The data set was split into a test set 
(the first million positions) and a training set (the remaining 28.4 million posi- 
tions). Pass moves were excluded from the data set. Each position consisted of a 
raw board description s and the move a selected by the human. We augmented the 
data set to include all eight reflections and rotations of each position. Symmetry 
augmentation and input features were pre-computed for each position. For each 
training step, we sampled a randonily selected mini-batch of m samples from 
the augmented KGS data set, {s*, ay k= , and applied an asynchronous stochastic 
gradient descent update to maximize the log likelihood of the action, 
Ao= < D 2 og pa (a, The step size a was initialized to 0.003 and was halved 
every 80 million training steps, without momentum terms, and a mini-batch size 
of m= 16. Updates were applied asynchronously on 50 GPUs using DistBelief®!; 
gradients older than 100 steps were discarded. Training took around 3 weeks for 
340 million training steps. 
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Policy network: reinforcement learning. We further trained the policy network 
by policy gradient reinforcement learning””°. Each iteration consisted of a mini- 
batch of n games played in parallel, between the current policy network p, that is 
being trained, and an opponent Po that uses parameters p from a previous iter- 
ation, randomly sampled from a pool of opponents, so as to increase the stability 
of training. Weights were initialized to p= p` = ø. Every 500 iterations, we added 
the current parameters p to the opponent pool. Each game i in the mini-batch was 
played out until termination at step T', and then scored to determine the outcome 
zi = + r(sņi) from each player’s ue The games were then replayed to 


determine the policy gradient update, Ap = * 7 a A i v(s}))s 
n p 


using the REINFORCE algorithm?’ with baseline v(s;) for variance reduction. On 
the first pass through the training pipeline, the baseline was set to zero; on the 
second pass we used the value network vg(s) as a baseline; this provided a small 
performance boost. The policy network was trained in this way for 10,000 mini- 
batches of 128 games, using 50 GPUs, for one day. 

Value network: regression. We trained a value network vg(s) ~ v/»(s) to approx- 
imate the value function of the RL policy network p,. To avoid overfitting to the 
strongly correlated positions within games, we constructed a new data set of uncor- 
related self-play positions. This data set consisted of over 30 million positions, each 
drawn from a unique game of self-play. Each game was generated in three phases 
by randomly sampling a time step U ~ unif{1, 450}, and sampling the first t=1,... 
U—1 moves from the SL policy network, a,~ p,(-|s;); then sampling one move 
uniformly at random from available moves, ay ~ unif{1, 361} (repeatedly until 
avis legal); then sampling the remaining sequence of moves until the game termi- 
nates, f= U+1, ... T, from the RL policy network, a;~ p,(-|s;). Finally, the game 
is scored to determine the outcome z;= +r(s7). Only a single training example 
(su+1, Zu+1) is added to the data set from each game. This data provides unbiased 
samples of the value function vo(sy.1) = E[zu+i|su+1 4u41....7 ~p, ]. During 
the first two phases of generation we sample from noisier distributions so as 
to increase the diversity of the data set. The training method was identical 
to SL policy network training, except that the parameter update was based on 


mean squared error between the predicted values and the observed rewards, 
AO = © SH (zk — vst) A 
mini-batches of 32 positions, using 50 GPUs, for one week. 

Features for policy/value network. Each position s was pre-processed into a set 
of 19 x 19 feature planes. The features that we use come directly from the raw 
representation of the game rules, indicating the status of each intersection of the 
Go board: stone colour, liberties (adjacent empty points of stone's chain), captures, 
legality, turns since stone was played, and (for the value network only) the current 
colour to play. In addition, we use one simple tactical feature that computes the 
outcome of a ladder search’. All features were computed relative to the current 
colour to play; for example, the stone colour at each intersection was represented 
as either player or opponent rather than black or white. Each integer feature value 
is split into multiple 19 x 19 planes of binary values (one-hot encoding). For exam- 
ple, separate binary feature planes are used to represent whether an intersection 
has 1 liberty, 2 liberties,..., >8 liberties. The full set of feature planes are listed in 
Extended Data Table 2. 

Neural network architecture. The input to the policy network is a 19 x 19 x 48 
image stack consisting of 48 feature planes. The first hidden layer zero pads the 
input into a 23 x 23 image, then convolves k filters of kernel size 5 x 5 with stride 
1 with the input image and applies a rectifier nonlinearity. Each of the subsequent 
hidden layers 2 to 12 zero pads the respective previous hidden layer into a 21 x 21 
image, then convolves k filters of kernel size 3 x 3 with stride 1, again followed 
by a rectifier nonlinearity. The final layer convolves 1 filter of kernel size 1 x 1 
with stride 1, with a different bias for each position, and applies a softmax func- 
tion. The match version of AlphaGo used k= 192 filters; Fig. 2b and Extended 
Data Table 3 additionally show the results of training with k = 128, 256 and 
384 filters. 

The input to the value network is also a 19 x 19 x 48 image stack, with an addi- 
tional binary feature plane describing the current colour to play. Hidden layers 2 to 
11 are identical to the policy network, hidden layer 12 is an additional convolution 
layer, hidden layer 13 convolves 1 filter of kernel size 1 x 1 with stride 1, and hidden 
layer 14 is a fully connected linear layer with 256 rectifier units. The output layer 
is a fully connected linear layer with a single tanh unit. 

Evaluation. We evaluated the relative strength of computer Go programs by run- 
ning an internal tournament and measuring the Elo rating of each program. We 
estimate the probability that program a will beat program b by a logistic function 
p(a beats b) = , and estimate the ratings e(-) by Bayesian 


. The value network was trained for 50 million 


1+ ae (b) — e(a) y 
logistic regression, computed by the BayesElo program” using the standard 
constant Celo = 1/400. The scale was anchored to the BayesElo rating of professional 
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Go player Fan Hui (2,908 at date of submission)”. All programs received a maxi- 
mum of 5s computation time per move; games were scored using Chinese rules 
with a komi of 7.5 points (extra points to compensate white for playing second). 
We also played handicap games where AlphaGo played white against existing Go 
programs; for these games we used a non-standard handicap system in which komi 
was retained but black was given additional stones on the usual handicap points. 
Using these rules, a handicap of K stones is equivalent to giving K — 1 free moves 
to black, rather than K — 1/2 free moves using standard no-komi handicap rules. 
We used these handicap rules because AlphaGo’ value network was trained spe- 
cifically to use a komi of 7.5. 

With the exception of distributed AlphaGo, each computer Go program was 
executed on its own single machine, with identical specifications, using the latest 
available version and the best hardware configuration supported by that program 
(see Extended Data Table 6). In Fig. 4, approximate ranks of computer programs 
are based on the highest KGS rank achieved by that program; however, the KGS 
version may differ from the publicly available version. 

The match against Fan Hui was arbitrated by an impartial referee. Five 
formal games and five informal games were played with 7.5 komi, no handi- 
cap, and Chinese rules. AlphaGo won these games 5-0 and 3-2 respectively 
(Fig. 6 and Extended Data Table 1). Time controls for formal games were 1h main 
time plus three periods of 30s byoyomi. Time controls for informal games were 
three periods of 30s byoyomi. Time controls and playing conditions were chosen 
by Fan Hui in advance of the match; it was also agreed that the overall match 
outcome would be determined solely by the formal games. To approximately 
assess the relative rating of Fan Hui to computer Go programs, we appended the 
results of all ten games to our internal tournament results, ignoring differences 
in time controls. 
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Extended Data Table 1 | Details of match between AlphaGo and Fan Hui 


Date Black White Category Result 

5/10/15 Fan Hui AlphaGo Formal AlphaGo wins by 2.5 points 
5/10/15 Fan Hui AlphaGo Informal Fan Hui wins by resignation 
6/10/15 AlphaGo Fan Hui Formal AlphaGo wins by resignation 
6/10/15 AlphaGo Fan Hui Informal AlphaGo wins by resignation 
7/10/15 FanHui AlphaGo Formal AlphaGo wins by resignation 
7/10/15 Fan Hui AlphaGo Informal AlphaGo wins by resignation 
8/10/15 AlphaGo FanHui Formal AlphaGo wins by resignation 
8/10/15 AlphaGo FanHui Informal AlphaGo wins by resignation 
9/10/15 Fan Hui AlphaGo Formal AlphaGo wins by resignation 
9/10/15 AlphaGo Fan Hui Informal Fan Hui wins by resignation 


The match consisted of five formal games with longer time controls, and five informal games with shorter time controls. 
Time controls and playing conditions were chosen by Fan Hui in advance of the match. 
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Extended Data Table 2 | Input features for neural networks 


Feature # of planes Description 

Stone colour 3 Player stone / opponent stone / empty 

Ones 1 A constant plane filled with 1 

Turns since 8 How many turns since a move was played 

Liberties 8 Number of liberties (empty adjacent points) 

Capture size 8 How many opponent stones would be captured 

Self-atari size 8 How many of own stones would be captured 

Liberties after move 8 Number of liberties after this move is played 

Ladder capture 1 Whether a move at this point is a successful ladder capture 
Ladder escape 1 Whether a move at this point is a successful ladder escape 
Sensibleness 1 Whether a move is legal and does not fill its own eyes 
Zeros 1 A constant plane filled with 0 

Player color 1 Whether current player is black 


Feature planes used by the policy network (all but last feature) and value network (all features). 
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Extended Data Table 3 | Supervised learning results for the policy network 


Architecture Evaluation 
Filters Symmetries Features Test accu- Train accu- Raw _ net AlphaGo Forward 
racy % racy % wins % wins % time (ms) 

128 1 48 54.6 57.0 36 53 2.8 

192 1 48 55.4 58.0 50 50 4.8 

256 1 48 55.9 59.1 67 55 7.1 

256 2 48 56.5 59.8 67 38 13.9 
256 4 48 56.9 60.2 69 14 27.6 
256 8 48 57.0 60.4 69 5 55.3 

192 1 4 47.6 51.4 25 15 4.8 

192 1 12 54.7 57.1 30 34 4.8 

192 1 20 54.7 57.2 38 40 4.8 

192 8 4 49.2 53.2 24 2 36.8 

192 8 12 55.7 58.3 32 3 36.8 

192 8 20 55.8 58.4 42 3 36.8 


The policy network architecture consists of 128, 192 or 256 filters in convolutional layers; an explicit symmetry ensemble over 2, 4 or 8 symmetries; using only the first 4, 12 or 
20 input feature planes listed in Extended Data Table 1. The results consist of the test and train accuracy on the KGS data set; and the percentage of games won by given policy 
network against AlphaGo’s policy network (highlighted row 2): using the policy networks to select moves directly (raw wins); or using AlphaGo’s search to select moves (AlphaGo 
wins); and finally the computation time for a single evaluation of the policy network. 
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Extended Data Table 4 | Input features for rollout and tree policy 


Feature # of patterns Description 

Response 1 Whether move matches one or more response pattern features 
Save atari 1 Move saves stone(s) from capture 

Neighbour 8 Move is 8-connected to previous move 

Nakade 8192 Move matches a nakade pattern at captured stone 

Response pattern 32207 Move matches 12-point diamond pattern near previous move 
Non-response pattern 69338 Move matches 3 x 3 pattern around move 

Self-atari 1 Move allows stones to be captured 

Last move distance 34 Manhattan distance to previous two moves 

Non-response pattern 32207 Move matches 12-point diamond pattern centred around move 


Features used by the rollout policy (first set) and tree policy (first and second set). Patterns are based on stone colour (black/white/empty) and liberties (1, 2, >3) 
at each intersection of the pattern. 
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Extended Data Table 5 | Parameters used by AlphaGo 


Symbol Parameter Value 
B Softmax temperature 0.67 
À Mixing parameter 0.5 
Ny] Virtual loss 3 
Nthr Expansion threshold 40 
Cpuct Exploration constant 5 
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Extended Data Table 6 | Results of a tournament between different Go programs 


Short name Computer Player Version Time settings CPUs GPUs KGS Rank Elo 
alyp Distributed AlphaGo See Methods 5 seconds 1202 176 — 3140 
Qryp AlphaGo See Methods 5 seconds 48 8 — 2890 
CS CrazyStone 2015 5 seconds 32 - 6d 1929 
ZN Zen 5 5 seconds 8 - 6d 1888 
PC Pachi 10.99 400,000 sims 16 - 2d 1298 
FG Fuego svn1989 100,000 sims 16 - — 1148 
GG GnuGo 3.8 level 10 1 - 5k 431 
CS, CrazyStone 4 handicap stones 5 seconds 32 - — 2526 
ZN, Zen 4 handicap stones 5 seconds 8 - — 2413 
PC, Pachi 4handicap stones 400,000 sims 16 - — 1756 


Each program played with a maximum of 5 s thinking time per move; the games against Fan Hui were conducted using longer time controls, as described in Methods. CNa, ZNa 
and PCa were given 4 handicap stones; komi was 7.5 in all games. Elo ratings were computed by BayesElo. 
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Extended Data Table 7 | Results of a tournament between different variants of AlphaGo 


Short Policy Value Rollouts Mixing Policy Value Elo 
name network network constant GPUs GPUs rating 
Arup Do vg Pr A= 0.5 2 6 2890 
Qyp Do vg - A=0 2 6 2177 
Qrp Po — Pr A=1 8 0 2416 
Ory [pr] vg Pr A=0.5 0 8 2077 
Qy [p] vo — A=0 0 8 1655 
Qr [pr] - Pr A=1 0 0 1457 
Qp Po — — — 0 0 1517 


Evaluating positions using rollouts only (arp, ar), value nets only (ayp, ay), or mixing both (anp, an); either using the policy network Pa(arp, &vp, Arp), OF no policy 
network (anp, Avp, Arp), that is, instead using the placeholder probabilities from the tree policy p- throughout. Each program used 5 s per move on a single machine 
with 48 CPUs and 8 GPUs. Elo ratings were computed by BayesElo. 
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Extended Data Table 8 | Results of a tournament between AlphaGo and distributed AlphaGo, testing scalability 
with hardware 


AlphaGo Search threads CPUs GPUs Elo 

Asynchronous 1 48 8 2203 
Asynchronous 2 48 8 2393 
Asynchronous 4 48 8 2564 
Asynchronous 8 48 8 2665 
Asynchronous 16 48 8 2778 
Asynchronous 32 48 8 2867 
Asynchronous 40 48 8 2890 
Asynchronous 40 48 1 2181 
Asynchronous 40 48 2 2738 
Asynchronous 40 48 4 2850 
Distributed 12 428 64 2937 
Distributed 24 764 112 3079 
Distributed 40 1202 176 3140 


Distributed 64 1920 280 3168 


Each program played with a maximum of 2s thinking time per move. Elo ratings were computed by BayesElo. 
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Extended Data Table 9 | Cross-table of win rates in per cent between programs 
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Qrup Qup Qrp Ary Qr Qy Qp 

Qrvp - 1 [0; 5] 5 [4;7] O [0; 4] O [0: 8] O [0; 19 O [0; 19] 
Qyp 99 [95; 100] - 61 [52; 69] 35 [25; 48] 6 [1; 27] O [0; 22 1 [0; 6] 
App 95 [93; 96] 39 [31; 48] - 13 [7; 23] 0 [0; 9] O [0; 22 4 [1; 21] 
Ary 100 [96; 100 65 [52; 75] 87 [77; 93] - O [0; 18] 29 [8; 64 48 [33; 65] 
Qr 100 [92; 100 94 [73; 99] 100 (91: 100} 100 [82; 100] - 78 [45; 94] 78 [71; 84] 
Ay 100 [81; 100] 100 [78; 100] 100 [78; 100] 71 [36; 92 22 [6; 55] - 30 [16; 48] 
Qp 100 [81; 100 99 [94; 100] 96 [79; 99] 52 [35;67] 22 [16; 29] 70 [52; 84 - 

CS 100 [97; 100 74 [66; 81] 98 [94; 99] 80 [70; 87 5 [3:7] 36 [16; 61 8 [5; 14] 
ZN 99 [93; 100 84 [67; 93] 98 [93; 99] 92 [67; 99 6 [2; 19] 40 [12; 77 100 [65; 100] 
PC 100 [98; 100 99 [95; 100] 100 [98; 100] 98 [89; 100] 78 [73; 81] 87 [68; 95 55 [47; 62] 
FG 100 [97; 100 99 [93; 100] 100 [96; 100] 100 [91; 100] 78 [73;83] 100 [65; 100] 65 [55; 73] 
GG 100 [44; 100] 100 [34; 100] 100 [68; 100] 100 [57; 100] 99 [97; 100] 67 [21; 94] 99 [95; 100] 
CS, 77 [69; 84] 12 [8; 18] 53 [44; 61] 15 [8; 24] O [0; 3] O [0; 30] 0 [0: 8] 
ZN, 86 [77; 92] 25 [16; 38] 67 [56; 76] 14 [7; 27] O [0; 12] O [0; 43] - 

PC, 99 (97; 100] 82 [75; 88] 98 [95; 99] 89 [79; 95] 32 [26; 39] 13 [3; 36] 35 [25; 46] 


95% Agresti-Coull confidence intervals in grey. Each program played with a maximum of 5s thinking time per move. CN4, ZN4 and PC4 were given 4 handicap stones; 
komi was 7.5 in all games. Distributed AlphaGo scored 77% [70; 82] against awp and 100% against all other programs (no handicap games were played). 
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Extended Data Table 10 | Cross-table of win rates in per cent between programs in the single-machine scalability study 


Threads 1 2 4 8 16 32 40 40 40 40 
GPU 8 8 8 8 8 8 8 4 2 1 

1 8 - 70 [61;78] 90 [84:94] 94 [83:98] 86 [72:94] 98 [91;100] 98 [92;99] 100 [76;100] 96 [91;98] 38 [25:52] 
2 8 [30 [22:39] - 72 [61:81] 81 [71:88] 86 [76;93] 92 [83:97] 93 [86;96] 83 [69:91] 84 [75;90] 26 [17:38] 
4 8 |10 [6:16] 28 [19:39] - 62 [53:70] 71 [61;80] 82 [71:89] 84 [74:90] 81 [69:89] 78 [63:88] 18 [10;28] 
8 8 | 6 [2:17] 19 [12:29] 38 [30:47] - 61 [51:71] 65 [51;76] 73 [62;82] 74 [59;85] 64 [55;73] 12 [3:34 
16 8 |14 [6:28] 14 [7:24] 29 [20:39] 39 [29:49] - 52 [41:63] 61 [50;71] 52 [41:64] 41 [32:51] 5 [1:25 
32 8 | 2 [0:9] 8 [3:17] 18 [11:29] 35 [24:49] 48 [37:59 - 52 [42:63] 44 [32:57] 26 [17:36] O [0;30 
AO 8 | 2[1:8] 8 [4:14] 16 [10;26] 27 [18:38] 39 [29:50] 48 [37;58] - 43 [30:56] 41 [26;58] 4 [1;18 
AO A | 0[0;24] 17 [9:31] 19 [11;31] 26 [15:41] 48 [36:59] 56 [43:68] 57 [44:70] - 29 [18:41] 2 [0:11 
AO 2 | 4[2;9] 16 [10;25] 22 [12;37] 36 [27:45] 59 [49:68] 74 [64:83] 59 [42:74] 71 [59;82]  - 5 [1:17 
AO 1 |62 [48:75] 74 [62:83] 82 [72:90] 88 [66;97] 95 [75;99] 100 [70;100] 96 [82;99] 98 [89;100] 95 [83;99] - 


95% Agresti-Coull confidence intervals in grey. Each program played with 2s per move; komi was 7.5 in all games. 
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Extended Data Table 11 | Cross-table of win rates in per cent between programs 


in the distributed scalability study 


Threads AO 12 24 AO 64 
GPU 8 64 112 176 280 
CPU 48 428 764 1202 1920 
40 8 48 | - 52 [43; 61] 68 [59; 76] 77 [70; 82] 81 [65; 91 
12 64 428 |48 [39:57] - 64 [54; 73] 62 [41; 79] 83 [55; 95 
24 112 764 |32 [24; 41] 36 [27; 46] - 36 [20; 57] 60 [51; 69 
40 176 1202|23 [18; 30] 38 [21:59] 64 [43; 80] - 53 [39; 67 
64 280 1920/19 [9:35] 17 [5:45] 40 [31:49] 47 [33:61] - 


95% Agresti-Coull confidence intervals in grey. Each program played with 2s per move; komi was 7.5 in all games. 
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